Fast Construction of a Word-Number Index for Large Data

نویسندگان

  • Milos Jakubícek
  • Pavel Rychlý
  • Pavel Smerk
چکیده

The paper presents a work still in progress, but with promising results. We offer a new method of construction of word to number and number to word indices for very large corpus data (tens of billions of tokens), which is up to an order of magnitude faster than the current approach. We use HAT-trie for sorting the data and Daciuk’s algorithm for building a minimal deterministic finite state automaton from sorted data. The latter we reimplemented and our new implementation is roughly three times faster and with smaller memory footprint than the one of Daciuk. This is useful not only for building word↔number indices, but also for many other applications, e.g. building data for morphological analysers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Construction of Persian ICT WordNet using Princeton WordNet

WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...

متن کامل

A Comparative Study of Multipole and Empirical Relations Methods for Effective Index and Dispersion Calculations of Silica-Based Photonic Crystal Fibers

In this paper, we present a solid-core Silica-based photonic crystal fiber (PCF) composed of hexagonal lattice of air-holes and calculate the effective index and chromatic dispersion of PCF for different physical parameters using the empirical relations method (ERM). These results are compared with the data obtained from the conventional multipole method (MPM). Our simulation results reveal tha...

متن کامل

The effect of Yazd-Eghlid railway construction on diversity and richness of shrub and Bush-tree rangelands inYazd province

Railway construction is one of the important activities in the development of any country and in developing countries, the need for roads is one of the main axes of development. Railway construction operations can effect on desert rangelands around railway. This study investigates the effects of Yazd-Eghlid railway construction on vegetation diversity and richness in  the rangelands of Kalmand-...

متن کامل

The Effect of Observation Data Sampling Methods on Infiltration Areas by Maximum Entropy Model

Statistical modeling methods are based on multivariate regression methods and require the presence and absence location of data for the construction of the model. In most cases, there is no trustworthy absence data. Therefore, other methods that are based only on the presence of the phenomenon are used. Considering the importance of modeling - saving time and cost and the probable prediction of...

متن کامل

A Novel Multicast Tree Construction Algorithm for Multi-Radio Multi-Channel Wireless Mesh Networks

Many appealing multicast services such as on-demand TV, teleconference, online games and etc. can benefit from high available bandwidth in multi-radio multi-channel wireless mesh networks. When multiple simultaneous transmissions use a similar channel to transmit data packets, network performance degrades to a large extant. Designing a good multicast tree to route data packets could enhance the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013